State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation, object detection, foundation models, and knowledge distillation further shows RangeAugment's effectiveness.
translated by 谷歌翻译
用于移动设备的有效神经网络骨干通常针对诸如FLOPS或参数计数之类的指标进行优化。但是,这些指标在移动设备上部署时可能与网络的延迟不太相关。因此,我们通过在移动设备上部署多个移动友好网络来对不同指标进行广泛的分析。我们在最近有效的神经网络中识别和分析建筑和优化瓶颈,并提供减轻这些瓶颈的方法。为此,我们设计了一个高效的骨干莫比尼蛋白,在iPhone12上的推理时间低于1毫秒,ImageNet上的Top-1精度为75.9%。我们表明,Mobileone在高效体系结构中实现了最先进的性能,同时在移动设备上的速度更快。我们的最佳模型在38倍的速度中,在Imagenet上的性能与移动形式相似。与在类似延迟时,我们的模型在ImageNet上获得了2.3%的TOP-1精度。此外,我们表明我们的模型概括为多个任务 - 图像分类,对象检测和语义分割,与在移动设备上部署时现有的有效体系结构相比,延迟和准确性的显着提高。
translated by 谷歌翻译
对人类的逼真渲染和安息对于实现增强现实体验至关重要。我们提出了一个新颖的框架,以重建人类和场景,可以用新颖的人类姿势和景色从一个单一的野外视频中呈现。给定一个由移动摄像机捕获的视频,我们训练了两个NERF模型:人类NERF模型和一个场景NERF模型。为了训练这些模型,我们依靠现有方法来估计人类和场景的粗糙几何形状。这些粗糙的几何估计值使我们能够创建一个从观察空间到独立姿势独立的空间的翘曲场10秒的视频剪辑,并以新颖的观点以及背景提供新颖的姿势,提供人类的高质量效果。
translated by 谷歌翻译
在视觉检索系统中,更新嵌入式模型需要每条数据的重新计算功能。该昂贵的过程称为回填。最近,提出了向后兼容培训(BCT)的想法。为避免回填的成本,BCT修改了对新模型的培训,使其与旧模型兼容的表示。但是,BCT可以显着地阻碍新模型的性能。在这项工作中,我们提出了一种新的学习范例来代表学习:前进兼容培训(FCT)。在FCT中,当旧型号接受培训时,我们还为未来的未知版本做好准备。我们提出学习侧信息,每个样本的辅助功能,促进了模型的未来更新。为了开发一个强大而灵活的模型兼容框架,我们将侧面信息与旧嵌入到新嵌入的前向转换相结合。新模型的培训没有修改,因此,其准确性不会降低。与各种数据集的BCT相比,我们展示了显着的检索准确性改进:Imagenet-1K(+ 18.1%),Place-365(+ 5.4%)和VGG-Face2(+ 8.3%)。 FCT在不同数据集,损失和架构培训时获得模型兼容性。
translated by 谷歌翻译
具有可控的生成序列模型具有提取和复制特定示例样式的能力,可以实现许多应用程序,包括在不同声音中叙述有声读物,自动完成和自动校正书面手写,以及为下游识别任务生成缺少的培训示例。但是,在无监督式的设置下,可控序列生成模型的典型训练算法遭受了训练 - 推导不匹配的影响,在训练过程中,相同的样品在训练过程中用作内容和样式输入,但在推断期间给出了未配对的样本。在本文中,我们解决了在无监督的可控生成序列模型中遇到的训练推断不匹配。所提出的方法很简单却有效,我们使用样式转换模块将目标样式信息传输到无关的样式输入中。此方法可以使用未配对的内容和样式样本进行培训,从而减轻训练推荐不匹配。我们将样式均衡应用于三个数据集上的文本对语音和文本写作合成。我们进行彻底的评估,包括定量和定性用户研究。我们的结果表明,通过减轻培训 - 推导与拟议的样式均衡的不匹配,我们在用户研究中实现了与真实数据相当的样式复制分数。
translated by 谷歌翻译
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
translated by 谷歌翻译
With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors. We make several key modifications to the standard GAN algorithm to preserve annotations, avoid artifacts, and stabilize training: (i) a 'self-regularization' term, (ii) a local adversarial loss, and (iii) updating the discriminator using a history of refined images. We show that this enables generation of highly realistic images, which we demonstrate both qualitatively and with a user study. We quantitatively evaluate the generated images by training models for gaze estimation and hand pose estimation. We show a significant improvement over using synthetic images, and achieve state-of-the-art results on the MPIIGaze dataset without any labeled real data.
translated by 谷歌翻译
We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal distributions. This is achieved by enforcing a weight-sharing constraint that limits the network capacity and favors a joint distribution solution over a product of marginal distributions one. We apply CoGAN to several joint distribution learning tasks, including learning a joint distribution of color and depth images, and learning a joint distribution of face images with different attributes. For each task it successfully learns the joint distribution without any tuple of corresponding images. We also demonstrate its applications to domain adaptation and image transformation. To overcome the limitation, we propose the coupled generative adversarial networks (CoGAN) framework. It can learn a joint distribution of multi-domain images without existence of corresponding images in different domains in the training set. Only a set of images drawn separately from the marginal distributions of the individual domains is required. CoGAN is based on the generative adversarial networks (GAN) framework [5], which has been established as a viable solution for image distribution learning tasks. CoGAN extends GAN for joint image distribution learning tasks.CoGAN consists of a tuple of GANs, each for one image domain. When trained naively, the CoGAN learns a product of marginal distributions rather than a joint distribution. We show that by enforcing a weight-sharing constraint the CoGAN can learn a joint distribution without existence of corresponding images in different domains. The CoGAN framework is inspired by the idea that deep neural networks learn a hierarchical feature representation. By enforcing the layers that decode high-level semantics in the GANs to share the weights, it forces the GANs to decode the high-level semantics in the same way. The layers that decode low-level details then map the shared representation to images in individual domains for confusing the respective discriminative models. CoGAN is for multi-image domains but, for ease of presentation, we focused on the case of two image domains in the paper. However, the discussions and analyses can be easily generalized to multiple image domains.
translated by 谷歌翻译